Red Wine Quality by Lauran Hazan

Setups:

Settings to make knitted HTML readable (thank you, first project reviewer!)

Load libraries

Load data

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Introduction the data set:

Dataset doc: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt This dataset contains 1599 records and 13 variables which describe red wine characteristics as well as the quality rating average given by 3 wine experts.

Univariate Plots Section

Let’s see what the quality variable looks like:

Let’s create some categorical variables. Let’s classify wines by sweet or not (45 g/L or more), and let’s also create one for high-sulfite wines (>62.00). S02 is sulfites and they are usually the identified cause of the allergy-like symptoms many people get from wine. (Reference: ‘Skinny Bitch’, Freedman & Barnouin) More cool info on the what and why of sulfites in wine here: https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/

## 
## Normal  Sweet 
##   1502     97
## 
##   High Normal 
##     47   1552

This is actually kind of interesting - the high-sulfite wines do show a different distribution of quality ratings. None in the 3 and very few in the 4. Maybe this is because sulfites keep wine from going bad, so lower-sulfite wines may tend to be spoiled more often. But the proportion of higher-quality ratings (above 5) are much lower than the other wines. This matches with my experience - I’m among people (estimated at up to 10% of wine drinkers) who have unpleasant reactions to sulfites including itching, skin redness, quick-onset headache. High-sulfite wines tend to have an overpowering smell for some due to this sensitivity - so it makes sense they’d be rated a bit lower.

Let’s do the same for sweet wines:

This is unexpected! Most of my red wine drinking friends claim to dislike sweeter wines (same with me) - and yet, we see the distribution of the higher-quality ratings spreads further into higher ratings for the sweet wines, whereas the normal wines rated above 5 are mostly rated 6!

After initial look, let’s see what some of the other distributions look like:

There are some outliers in some of these variables - let’s try using limitation of x-axis to see if we can get a better view of these distributions:

Let’s take a closer look at residual sugar and chlorides:

These are indeed right-skewed distributions. These variables may or may not have a relationship with quality rating, but we’d need to be careful building models, as they aren’t normally distributed.

Univariate Appendix:

In the bivariate and multivariate sections of this analysis, I created some new variables. In this section, we’ll take a look at them on their own:

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality  sweet hi.sulfite quality.f total.pres
## 1       5 Normal     Normal         5       0.56
## 2       5 Normal     Normal         5       0.68
## 3       5 Normal     Normal         5       0.69
## 4       6 Normal     Normal         6       1.14
## 5       5 Normal     Normal         5       0.56
## 6       5 Normal     Normal         5       0.56

Let’s take a look at the new total preservatives:

The histogram shows that there are some outliers. Let’s see what that is:

I’m going to handle these by limiting the x axes to values that fall within the 98% of values:

Univariate Analysis

What is the structure of your dataset?

1599 obs. of 13 variables (not including the two that I created)

What is/are the main feature(s) of interest in your dataset?

Overall, the important result variable here is quality. What we’re looking for is ultimately the variables which can affect the quality of wine. Maybe this can help us find some wine store bargains or at least help us sound smart at fancy parties :)

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I don’t know too much about red wine, but I don’t like sweet wines and I have allergy-like reactions to some red wines which according to my reading (referenced above), is likely attributable to sulfites (S02 and S03).

Another interesting feature is alcohol content. There’s a common claim I have heard often that alcohol content is a good proxy for wine quality. I don’t know if it’s true, but maybe we can test that in this EDA exercise!

The below excellent explanation (reference below) explains more info about each of the variables in the dataset:

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

REFERENCE: This summary came from the reference file provided by Udacity: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Did you create any new variables from existing variables in the dataset?

Yes. I created two new variables: 1. sweet: classify wines by sweet or not (residual sugar of 45 g/L or more) - according to the reference documentation for the data set, that’s the standard for whether a wine is considered sweet. 2. hi.sulfite: classify wines by total.sulfur.dioxide content - in this case, I took the wines with free sulfites higher than 40 (based on documentation above). 3. total.pres: I created a total preservatives variable (see further below in the analysis)

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Most of the variables were normally distributed. The ones which were not all had similar long-tail skews to the left.

Bivariate Plots Section

Based on the analysis so far, it seems we need to take a closer look at alcohol content and volatile acidity. Just to be sure, let’s take another look at variable correlations:

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632
## quality               1.00000000

From the above, as well as the objectives of the analysis (find bargains and sound smart), it looks like it would make sense to focus on variables we can actually use - remember we’re reading wine labels in the store, not doing chemical tests AND that actually seem to have some linear relationship to quality! That narrows it down nicely - it leaves us with alcohol, sulphates and citric acid. While volatile acidity does seem to have a relationship, there are two problems with this variable - 1. it’s apparently less likely to happen in higher S02 wines (reference: https://winemakermag.com/676-the-perils-of-volatile-acidity) - so it may not be totally independent. 2. It’s not on the wine label. HOWEVER, since many wine critiques and signs at wine stores will talk about acidity, let’s leave it in there and see what happens.

Let’s create some scatter plots to look at these relationships a bit:

We can see that alcohol has a strong relationship when we run the cor function- but looking at the box plot, the pattern isn’t consistent - lower quality ratings don’t seem to show much relationship - but the higher ones do. Let’s take a closer look:

This is interesting - it’s sort of a histogram of histograms for each quality rating. It tells us a few things: First that the boxplots above are a bit visually misleading - the amount of data for quality levels 3 and 8 is very small, so the eye is perhaps seeing a stronger pattern than is there.

Let’s look and see if we fit models to all ratings and just to the middle ones, would they look the same?

Interesting! Indeed our middle-rated wines show a different pattern: alcohol content has a fairly strong positive relationship with quality rating only for acohol contents between about 9.5% - 12.25%. Outside of those bounds, the relationship might actually be negative! When we include all quality levels, this phenomenon disappears.

Let’s do the same for Volatile Acidity:

While we can see that the middle-rated wines’ model looks a little different than the one built on all wines, they both follow similar overall patterns (unlike for alcohol content) - This is a bit of a nicer way to show what the box plot shows: more VA will tend to hurt a wine’s quality rating.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Unsurprisingly, volatile acidity is negatively correlated with quality - higher VA basically means the wine is closer to being vinegar, which is obviously not what most people are looking for when they drink wine.

What was the strongest relationship you found?

Alcohol content was overall strongest, but volatile acidity was also pretty strong. We also see apparent relationships with citric acid and sulphates and quality. I did some reading on these last two variables and we need to explore their relationship to volatile acidity.

Multivariate Plots Section

Citric acid and sulphates are preservatives that are specifically used for controlling wine quality and freshness - i.e. keeping it from turning into vinegar. I created a new variable for total preservatives (total.pres) so we could simplify our analysis of the relationship between VA and the chemicals used to control VA!

Reading further in this article: https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/ I came across a claim that might help to explain why we aren’t seeing any relationship between the anti-oxidant preservative sulfites and VA - higher alcohol and lower pH means sulfites are less needed to prevent oxidation. (screw caps can also help here - who knew!?)

What stands out here is that quality levels 5, 6 and 7 seem to behave similarly. The lower and higher values are different and also have much less data, so we need to be careful when drawing conclusions about the entire dataset. With wines rated 6 or 7, it does look like maybe higher pH means that more sulfites are likely to be used. This is consistent with the wine making article, but we need to be careful - the number of high-sulfite wines isn’t very large.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

This part of the analysis actually allowed us to simplify how we view variables with a relationship to quality. If we combine this with some reading about wine-making, it quickly becomes evident that many variables in the data set are related - for example, density and residual sugar. Any substance dissolved in wine by definition lowers the density of the liquid, so it stands to reason that there is a relationship here. Also, many of the chemical additives in wine are acids - e.g. citric acid. So the pH level of the wine should be influenced by the presence of these.

Were there any interesting or surprising interactions between features?

Preservatives are used to control fermentation and keep the VA down, so it makes sense that citric acid also shows a relationship. What is surprising is that sulphur dioxide - also a preservative used to keep VA down is not showing a strong relationship to quality! This is surprising, because on wine websites when you google how to deal with wine acidity, it seems the standard advice is to use more sulfur dioxide.

So here is the problem: If we combine preservatives, we find that they have a pretty strong positive relationship with quality. HOWEVER, preservatives are what minimize volatile acidity - which has a strong negative relationship with quality. So we shouldn’t assume that preservatives predict quality - there is likely a lot of dependency between variables here and common sense would lead us to conclude that it’s the volatile acidity that matters to quality.

Other surprises: 1. Residual sugar is positively related with quality. This is unexpected, as usually we think of drier wines as higher quality. 2. The sulfites analysis didn’t reflect the expert’s advice for wines rated a 5. But this may have been because of a very small number of high-sulfite wines in that category. However, this would be an area to look at in the full analysis.


Final Plots and Summary

Plot One: Alcohol Content and Quality:

Description One: Alcohol content does seem to have a positive relationship with quality rating. When we remove the extreme ratings and focus on those in which most of the wines fall (5-7), the pattern is more pronounced. However, it’s important to note that there is still lots of variability in alcohol content for each quality rating.

Plot Two: Volatile Acidity and Quality:

Description Two: We can see a clear relationship between volatile acidity and quality ratings. This makes sense intuitively - no one likes drinking wine that tastes like vinegar!

Plot Three: Total Preservatives vs. Volatile Acidity:

Description Three:

This plot is important because it explains the reason why we dropped the analysis on sulphates and citric acid: They are clearly related to volatile acidity. This isn’t surprising - preservatives are used expressly for the purpos of controlling wine quality.


Reflection

The most important lesson I learned with this analysis is that one MUST put variables into context. This dataset looks “richer” than it actually is. While there are several variables that appear to relate to quality, when you do a little bit of reading about how wine is made, it becomes apparent that the relationships between variables are important! Primarily that preservatives used in wine are used for a reason - they keep the wine from turning into vinegar and therefore staying higher quality!

Lesson 2 was one must keep one’s personal biases in check! I can’t handle high-sulfite wines and while it’s not that uncommon for people to have allergy-like symptoms from sulfites, it appears to be uncommon enough that it doesn’t affect quality ratings. Of course perhaps someone with this problem is much less likely to become a wine expert and therefore this view of high-sulfite wines won’t show up in expert-rated data sets!